-
Notifications
You must be signed in to change notification settings - Fork 464
[misc][torchair] fix bugs around deepseek mtp
, enable_shared_expert_dp
and use_cached_kv_cache_bytes
#3074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[misc][torchair] fix bugs around deepseek mtp
, enable_shared_expert_dp
and use_cached_kv_cache_bytes
#3074
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in DeepSeek's Multi-Token Prediction (MTP) when enable_shared_expert_dp
is active. The core of the fix involves conditionally using TorchairDeepSeekMTP
for this configuration. Additionally, the logic for handling attention metadata has been updated to be more robust by avoiding a hardcoded dictionary key. My review points out a potential crash in this new metadata handling and provides a suggestion for a safer implementation.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
9669e03
to
b359070
Compare
…ange in NPUModelRunner Signed-off-by: linfeng-yuan <1102311262@qq.com>
…_cache_bytes disabled Signed-off-by: linfeng-yuan <1102311262@qq.com>
deepseek mtp
, enable_shared_expert_dp
and use_cached_kv_cache_bytes
Hello! You've invoked me with |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several targeted fixes for bugs related to deepseek_mtp
, enable_shared_expert_dp
, and KV cache memory management. The changes correctly adjust MoE state handling and metadata synchronization for shared_expert_dp
, prevent an unnecessary reduction in available KV cache memory when use_cached_kv_cache_bytes
is disabled, and improve the robustness of accessing attention metadata. The fixes appear correct and well-aligned with the stated objectives. I have no major concerns with this pull request.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
1d27f71
to
6fd21e8
Compare
@wangxiyuan I pushed a new commit to tackle the last problem with large ep deployments and fix the UT break. Please review this commit and check whether there is any blocking problem to merge this pr. |
What this PR does / why we need it?
This miscellaneous contains several small fixes:
shared_expert_dp
enabled.use_cached_kv_cache_bytes
disabled.fused_moe_state
fromMC2
toAll2All
since the padding logic ofmc2_mask
is incompatible with input hidden_states whenshared_expert_dp
enabled.Once this PR is merged, users can launch disaggregated_prefill deployments (large_ep) with
deepseek_mtp
andshared_expert_dp
asv0.9.1-dev
branch. The remaining problem of kv_cache tokens decline compared tov0.9.1-dev
will be resolved by #3073.Does this PR introduce any user-facing change?
No.
How was this patch tested?
E2E vllm serving about deepseek_mtp with torchair graph mode and
enable_shared_expert_dp
with eager mode. Large ep deployments are also tested with this PR.